Python

Python set usage benchmark

For this article, I will be using Python 3.6 and IPython 6.2.1.

Before starting on the technical side, let's present quickly python sets. if you want to have the full documentation don't hesitate to go to the official documentation.

Sets are Python data structures that can contain any other data type object. For each element, a hash is computed. Consequently, there can't be duplicated values as two same values would get the same hash. It also means that accessing an object in a set should be close to O(1) but it can reach O(n) in the worst case.1.

It makes sets extremely efficient for membership testing. Contrary to a dictionary a value in a set doesn't point to another value.

There are two ways of creating a set:

In [1]: {"a","b"}
Out[1]: {'a', 'b'}

In [2]: set(["a","b"])
Out[2]: {'a', 'b'}

To test the membership of an object you can just use the usual in.

In [1]: "a" in {"a","b"}
Out[1]: True

What is the cost of creating a set? To test it we are generating a list and we then create a set based on it. We use the timeit feature of ipython.

In [1]: r = list(range(0,10))

In [2]: %timeit set(r)
560 ns ± 35.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

We are drawing both for set and for list, as it will be important for future experimentation.

Here are the results:

We can see that for a relatively low number of elements, the time needed to create a set is growing way faster than the time needed to create a list of the same size. It is expected as the set needs to compute hashes and manage collisions.

On a really large number of elements, we can see that both seem to tend to take the same time. It is surely due to the memory reallocation that is needed while growing both data structures. The time needed for memory reallocation is becoming predominant in the creation process.

So now the question, based on the creation time of both structures when is it more interesting of doing:

if 1 in {1,2,3,4}:
    pass

Than doing:

if 1 in [1,2,3,4]:
    pass

We are going to test three cases here. For each one, we are instantiating the data structure and we are then doing the membership testing.

In the case of the list, as Python needs to go through it all, we are going to consider two cases, the best case When the item we are looking for is the first item of the list, and the worst case when the item is not in the list.

Here are the results obtained.

From the experiment, we can quickly see that using sets in conditions is not a good idea. The list's best case is always way better than the set and the worst case is never really far from it.

There would be really little gain, to try to use a set in a condition if you instantiate it only for it.

So when is a set becoming more interesting than a list for membership testing?

As a set is slower to generate than a list, then it means that sets become interesting only when there are several membership tests to operate on it.

Let's take a look at the time cost of membership testing for both of them, same methodology as the previous try.

The difference is small but the set membership testing is always faster than the list one. We can see here that the O(1) complexity is achieved. It is the same complexity as testing the first element of the table, therefore they are pretty close. But in the case of worst case where complexity would be O(n) it is striving.

So if there are multiple membership tests the set becomes more interesting, the question is how many do you need before it becomes interesting?

It seems there is a correlation where the number of membership testing needed is twice as high as the number of elements, in the list best case. In the worst case, if we take into account the deviation, it seems the set is becoming interesting after the first test.

But sets can be used in a lot of different ways. I am not thinking of comparing their efficiency on their special methods like union, or difference.

To deduplicate a list it is not rare to see:

set(my_list)

How efficient it is to use a set from scratch? Let's consider two cases, when all the elements are different, and when they are all similar.

set(my_list)

The code is used for getting the information we need. We will use three functions. The first one is to test deduplication in set, the second one is the same with the conversion of the set to a list in the end, and the last one is the most efficient list deduplication, which is to avoid insertion of possible duplicated elements.

from timeit import timeit as ti


def deduplicate_set(elements):
    my_set = set()

    # Simulating the progressive addition of values
    for element in elements:
        my_set.add(element)
        # That's all set don't have duplicated values


def deduplicate_set_final_list(elements):
    # In this case we are also taking in account the
    # fact that someone want a list in the end.
    my_set = set()

    # Simulating the progressive addition of values.
    for element in elements:
        my_set.add(element)
    # Conversion of the set into a list.
    list(my_set)


def deduplicate_list_pre(elements):
    my_list = list()

    # Simulating the progressive addition of values.
    for element in elements:
        # We avoid duplication by checking before insertion.
        if element not in my_list:
            my_list.append(element)


# To avoid temporary skewing we do several iterations.
nb_iter = 1000

tests_to_do = [10, 100, 500, 1000]
for value_to_test in tests_to_do:
    elements = list(range(value_to_test))
    t1 = ti("deduplicate_set(elements)",
            setup="from __main__ import elements, deduplicate_set",
            number=int(nb_iter)) / nb_iter
    t2 = ti("deduplicate_set_final_list(elements)",
            setup="from __main__ import elements, deduplicate_set_final_list",
            number=int(nb_iter)) / nb_iter
    t3 = ti("deduplicate_list_pre(elements)",
            setup="from __main__ import elements, deduplicate_list_pre",
            number=int(nb_iter)) / nb_iter
    print("results for {} elements".format(value_to_test))
    print(t1)
    print(t2)
    print(t3)

All elements are different:

This was predictable because the in is currently a O(n2) function. In this case, the set is shining as thousands of suns. Even with a really low number of elements it is faster than the list.

Now let's try with all the elements being identical.

for value_to_test in tests_to_do:
    elements = [1] * value_to_test
    t1 = ti("deduplicate_set(elements)",
            setup="from __main__ import elements, deduplicate_set",
            number=int(nb_iter)) / nb_iter
    t2 = ti("deduplicate_set_final_list(elements)",
            setup="from __main__ import elements, deduplicate_set_final_list",
            number=int(nb_iter)) / nb_iter
    t3 = ti("deduplicate_list_pre(elements)",
            setup="from __main__ import elements, deduplicate_list_pre",
            number=int(nb_iter)) / nb_iter
    print("results for {} elements".format(value_to_test))
    print(t1)
    print(t2)
    print(t3)

All elements are identical:

When the element to check is the first one, then the list is faster than the set. It is a case that I guess would rarely happen. But it is good to know.

The difference is way smaller than the previous test. If you can predict how your data are going to look, you will be able to make the right choice. Otherwise, the set seems to be the way to go.

To conclude this musing session, I must admit, this is far from what I imagined, I thought the set would be way faster to build, and it would be a handy tool for many cases. But it seems this drawback will limit its use.

By PXke
the 10/05/2018 tags: Python, Set, Performance, Updated: 10/05/2018